Racial underrepresentation in dermatological datasets leads to biased machine learning models and inequitable healthcare

Objective: Clinical applications of machine learning are promising as a tool to improve patient outcomes through assisting diagnoses, treatment, and analyzing risk factors for screening. Possible clinical applications are especially prominent in dermatology as many diseases and conditions present visually. This allows a machine learning model to analyze and diagnose conditions using patient images and data from electronic health records (EHRs) after training on clinical datasets but could also introduce bias. Despite promising applications, artificial intelligence has the capacity to exacerbate existing demographic disparities in healthcare if models are trained on biased datasets. Methods: Through systematic literature review of available literature, we highlight the extent of bias present in clinical datasets as well as the implications it could have on healthcare if not addressed. Results: We find the implications are worsened in dermatological models. Despite the severity and complexity of melanoma and other dermatological diseases as well as differing disease presentations based on skin-color, many imaging datasets underrepresent certain demographic groups causing machine learning models to train on images of primarily fair-skinned individuals leaving minorities behind. Conclusion: In order to address this disparity, research first needs to be done investigating the extent of the bias present and the implications it may have on equitable healthcare.


Introduction
Modern advances in artificial intelligence are phenomenal as trained machine learning models are learning to solve humanities greatest problems. In addition to the many technical problems machine learning models are attempting to solve, there is a plethora of research on the possible clinical applications of machine learning [1][2][3][4][5]. Prediction models can be applied to many clinical problems and can function as promising tools for diagnosis, risk analysis, and patient treatment [6][7][8][9][10]. Clinical applications of machine learning are no longer a goal for the future. Recently, artificial intelligence has been successfully implemented in many studies to assist clinical outcomes in the COVID-19 pandemic and work on other pressing medical issues [11][12][13][14]. The advantages supervised machine learning can supply are tremendous; improved patient outcomes due to automated screening, assistance for medical providers, and greater efficiency are just a few of the possible advantages using such technology can allow [15,16]. Through training a model on clinically relevant datasets and electronic health records some supervised machine learning models have even performed better than licensed physicians [8,17].
Along with the prominent advantages of machine learning, there are many associated risks with using artificial intelligence for clinical applications [18]. One of the most prominent concerns for the use of machine learning is a biased impact on healthcare. Universally, healthcare systems struggle in order to minimize racial, ethnic, and other demographically based disparities [19,20]. The field of dermatology is no exception to exhibiting such disparities [21]. Due to prominent differences in skin color among patients, dermatological machine learning models may present a high risk of worsening healthcare disparities due to underrepresentation in training datasets. Dermatological models that learn from biased datasets risk biased performance that may leave minorities behind. In order to inform future artificial intelligence implementation by providers and researchers, it is necessary to investigate and address the extent that machine learning may exacerbate these racial and other demographic disparities in dermatology [22]. This review explores the extent, causes, possible solutions, and overall literature coverage on bias in dermatological machine learning models.

Methods
We searched and selected using the PubMed Advanced Search Builder and MeSH keyword queries. In order to investigate the extent and relation between artificial intelligence and biased clinical applications, we used the advanced keyword query "machine learning" AND "bias" AND "healthcare" yielding 112 results of which 26 were selected for further review. In order to refine the search to literature pertaining to bias specifically in dermatological applications of machine learning, we also searched for "dermatology" AND "machine learning" AND "bias" and obtained 11 results of which 6 were selected for review. The advanced keyword query "dermatology" AND "machine learning" AND "diversity" was also used to investigate underrepresentation and overrepresentation of dermatological datasets and models. The search yielded 7 additional results of which 2 were selected for further review. Lastly, in order to gain insight into the effect that the bias such underrepresentation and overrepresentation may introduce into artificial intelligence models during model training specifically, we searched for "machine learning training" AND "diversity" and "bias" which output 9 results of which 4 were selected for further review.
Strict inclusion criteria included clinical trials (including randomized control trials), metaanalyses, review articles, and clinical artificial intelligence papers. Papers found through the systematic literature search were further assessed for relevance by title and abstract. In order to conduct a focused analysis, only papers focusing on clinical applications of machine learning or the demographic disparity inherent in their use were included and texts based solely on machine learning development outside a clinical setting were therefore excluded. All articles selected were written in English in order to mitigate errors due to mistranslations. A high emphasis was placed on selecting articles in reputable journals due to the quality control inherent in the peer-review process. Some reputable and widely used websites were also included due to their relevance however non-peer-reviewed sources without a strong reputation were excluded. Reputability was primarily determined by the presence of a peer-review process. For sources without a peer review process, reputability was determined by high web traffic and citation count shown in other scientific literature. Eligibility and reputability determinations were discussed, and disagreements were resolved openly by all authors. The search was conducted from January 10, 2022, to January 15, 2022.

Bias in clinical datasets
There is a great amount of literature on the bias of healthcare datasets [7,15,23,24]. Recently, a large portion of the literature has focused on biases in gender leading to greater acknowledgement of the disparity present [25]. Despite this progress, bias in clinical datasets extends to many other areas in addition to gender and still affects a sobering majority of machine learning models [6,12,14]. Risk of bias analysis (RoB) is a statistic computed based on the features of a model and the dataset it is trained on in order to quantify the possible bias present [26]. According to a RoB analysis of machine learning prediction models for COVID-19 diagnosis or prognosis by Adamidi et al, 97% (98/101) of the models included were designated as having a high overall risk of bias. All of the remaining three models were designated as having an unclear risk of bias and no models were designated as having a low risk of bias [11]. This alarmingly high percent of models at high risk for bias is very significant and warrants a search for the cause. A further breakdown of this RoB analysis shows that 30% (30/101) of these studies had a high risk of bias and 21% (21/101) had an unclear risk of bias when looking at participants specifically [11]. The high percentage of risk of bias in features related to participants showcases that the overall high risk of bias in these prediction models is possible to be heavily due to the representation of participants in the training datasets. Other reviews of models further corroborate these consequential results. Another RoB analysis for similar prediction models found that 100% (n=11) of included models were designated as having a high overall risk of bias with 18% (2/11) of studies having an unclear risk of bias in regards to participants [13].
Perhaps unsurprisingly, the large prevalence of biased datasets translates to dermatological datasets as well [27]. In a RoB analysis by Dick et al. 97% (128/132) of melanoma prediction models were designated as having a high risk of bias in at least one category with 44% (58/132) having a high risk of bias in at least two categories [28]. Similar to non-dermatological models, the high risk of bias in many of these models is likely due to under and over representation of various groups in the datasets the models are trained on as evidenced by the high risk of bias in features associated with participants. For one example, the LFW dataset which is a dataset used as a top tier benchmark for face recognition was estimated to consist of 77.5% (10258/13233) male faces and 83.5% (11045/13233) White faces [7,29]. Similarly for dermatological training datasets with a majority of the images being of White patients, models trained on the datasets are highly likely to be biased towards successful prediction on White patients [30].

Absence of demographical data
Although many of these models report high overall prediction success, the breakdown of successful predictions by patient race is not shown for many dermatological datasets [7]. This is another pertinent problem as high-performance metrics and successful prediction rates may misrepresent model accuracy on specific patient groups. If models are deemed successful enough for clinical use due to overall performance metrics, these tools may only improve clinical outcomes for certain racial groups and leave minorities behind or even decrease clinical accuracy due to misinforming providers [7,12,13,17].
Furthermore, many databases and prediction model studies do not report demographic information entirely. In the nationalized health databases of countries such as France or Canada patient ethnicity or race is not reported at all [17]. According to Gupta and Kataraya, a large amount of training data is sourced from social media in order to create clinical datasets. Social media contains great amounts of data leading to large training sets however race, ethnicity, age, gender, or other demographic information is not always explicitly supplied or accurate resulting in datasets with no available information on representation within the dataset [3]. This is a large problem as underrepresentation or overrepresentation of groups within the dataset cannot be easily identified masking the possible risk of bias.
As a result, the complete extent of racial underrepresentation in dermatological datasets is still unknown. This is partially due to an absence of disclosure of the skin types appearing in the source images in many dermatological datasets specifically. In a systematic review by Guo et al, only 8.82% (12/136) of studies disclosed the race or ethnicity of participants in the source images of the dermatological datasets and only 4.41% (6/136) of the studies disclosed information on skin type [31]. It is very pertinent to identify the overall risk of bias in these datasets and models in order to address the disparities present in existing machine learning approaches [32].
The large prevalence of non-communication regarding race and ethnicity in dermatological datasets could possibly be due to the extreme underrepresentation present [27]. Out of the same 136 studies, only 2 studies included Hispanic individuals, only 1 study explicitly included Black patients and only 1 study explicitly included American Indian or Alaska Native patients [31]. A further analysis of these studies showcased that participant images used to train models were collected primarily in the United States and Europe as well as Australia and various East Asian countries with only 1.47% (2/136) of studies including individuals from South America or Africa [31]. Guo et al also draw attention to many specific popular image repositories that showcase the same underrepresentation; one example being the International Skin Imaging Collaboration whose data is collected predominantly from Europe and Australia with populations of primarily light-skinned individuals [31,33].

Black box algorithms
Similar to the unquantifiable risk of bias in clinical datasets due to a lack of demographical data, the risk of bias introduced through training machine learning models is also difficult to discern [16,17]. As modern artificial intelligence research improves, machine learning models are increasing in complexity and making predictions through the use of features and connections in ways unclear to even the developers of such algorithms [34]. It is important to recognize the inherent risk in using these "black box" algorithms that cannot be completely understood. Kelly et al. highlights the importance of transparent and trustworthy model decision-making as clinical settings require explainable and methodological approaches [16]. Unfortunately, there is an inverse correlation between model performance and model transparency as the best models are usually the most complex. Kelly et al. advise that much greater caution or information is needed in order to use these 'black box' algorithms in a clinical setting [16].

Data collection
In order to best address the large prevalence of bias in models, it is helpful to understand how such biases arrive in the datasets they are trained on [24]. One of the most prominent sources of bias is the method of collection used to obtain data for a training set. Data collection can take many forms that can all introduce bias in different ways. Randomized control trials (RCTs) attempt to be unbiased but often have inclusion and exclusion criteria that dramatically decrease the representativeness of the data they obtain [32]. In one example RCT to further asthma treatment, 94% of adults with asthma would not meet the inclusion criteria for the study [17]. Datasets using electronic health records such as the MIMIC-III dataset primarily include data from those who visit their respective intensive care units or emergency departments, but this also introduces biases as mostly White people have access to these healthcare resources while Black or Hispanic people are less likely to receive care at these locations [17]. Underrepresentation of many other groups such as undocumented immigrants and low-income nationals is also partially due to their inaccessibility to the sources of data collection. Racial disparities in these groups are then transferred to healthcare datasets as the datasets are created with the same disparate proportions [17]. Another consideration is data variability. Despite dermatological image collection being relatively easy compared to obtaining other medical images, dermatology images are widely varied and the least standardized [35]. When training models on dermatological datasets, functionality needs to be added to address to large image variety especially due to demographic factors such as skin color [36].

Missing data
Bias can be introduced even after data collection however as even parsing through data can introduce bias. In electronic health record datasets, many incomplete EHRs lower the training size of the dataset. To fix this, many studies will filter the dataset for only "complete" EHRs. According to a study by Weber et al. even this simple filtering to manage missing data introduced a bias towards older female patients [37]. If such a simple filter can introduce gender bias, it is pertinent to investigate the many similar filtering and data wrangling techniques that are common in data science to identify and address all the biases they may incur using recent research advances [38].

Duplicate data
Similar to missing data, duplicate data can also introduce bias into a dataset [11,39].
In large EHR datasets that may be compiled from multiple sources or have multiple patient encounters that are not indexed together, duplicates of various patient EHRs can be mistakenly added into the same training set. This can cause greater inconsistencies in representation as duplicate EHRs can cause greater percentages of various groups. There are many emerging solutions being developed to handle duplication bias such as fold-stratified cross validation, but further research still needs to be done in order to ensure these solutions do not introduce a different bias themselves [39].

Synthetic data
In order to avoid the many possible sources of bias when collecting clinical data, research has been done in order to develop methods of generating data. One alternative method to training prediction models on real data is to use synthetic data (i.e., a large created dataset based off of real data in order to increase the amount of training material for a model). Synthetic data can be useful when training models, however, according to Bhanot et al. synthetic datasets are similarly biased to real datasets. Upon analysis, synthetic versions of three popular clinical datasets all showed considerable bias. The MIMIC-III dataset showed overrepresentation of Whites and Asians and underrepresentation of Blacks. The ATUS dataset which tracks average sleep time of Americans was revealed to show that those 75 years or older or male were greatly underrepresented. In addition, the autism spectrum disorder (ASD) dataset was revealed to greatly overrepresent Whites and underrepresented Asians [12].

Implications of biased datasets
The most impactful consequence of biased datasets is their effect on a model's diagnoses or other outputs [7,12,13,40]. According to Mpanya et al, the role demographics plays in diagnosis should not be neglected. Many conditions have risk factors and presentations dependent on patient demographics. For example, those in high income countries, such as Europe and the United States, primarily suffer heart failure due to ischemic heart disease while in lower income countries such as sub-Saharan Africa, the predominant cause of heart failure is hypertension [4]. Clinical models trained on an under-representative dataset are more likely to have lower success rates on the underrepresented groups as they will not be sufficiently trained on these valuable connections [41]. Consequently, without a way to distinguish between different races or ethnicities, a model will attempt to diagnose utilizing demographic risk factors and other connections of the majority and will therefore not be as affective on underrepresented groups. This is especially true when considering dermatological datasets [42]. Skin-lesions, rashes and other dermatological conditions present with great differences based on skin-color, or other demographic factors in source images. Unless models are trained on a representative variety of skin-types, ethnicities, and other factors, they will be unable to predict on underrepresented groups accurately which will further healthcare disparities [27]. In one dataset with over 80% of the training images being of light-skinned individuals, prediction models trained on the dataset could not identify those with skin of color [14]. Another occurrence showcased a facial recognition model trained on a biased dataset that incorrectly classified 28 members of the US Congress as criminals as well as incorrectly classified 40% as the congress members as persons of color when only 20% were [14]. When testing a neural network trained on almost 20,000 images of skin lesions on a second dataset that the model did not train on, Han et al. found the model only correctly diagnosed 29% (29/100) of the lesions correctly indicating that dermatological models can be biased heavily for the dataset and types of images they are trained on [43].

Implications on melanoma
The disparity these inaccuracies cause is increased in dermatological cases as patients with skin of color are already more likely to present with more complex dermatological diseases and have worse survival rates than Whites [16,44,45]. Ideally, machine learning can be used to mitigate these worse survival rates through use as a diagnostic aid. Machine learning, deep learning and artificial intelligence have been used as a tool in cases of pigmentary skin lesions and malignancies, psoriasis, acne, allergic contact dermatitis, autoimmune disorders, and ulcer cases [42]. In addition to these, arguably the most pervasive dermatological area to consider the implications of machine learning for is melanoma or skin cancer cases. The great occurrence and severity of melanoma as well as the outcome improvement from early diagnosis increase the need for accurate machine learning models that can detect skin cancer early on [28,31]. In order to achieve models that can be relied upon, biases must be identified and removed or mitigated sufficiently so that all groups have access to possibly improved patient outcomes.

Conclusion
According to Pot et al, it is difficult to create completely unbiased datasets and models, however as researchers and providers, we should strive to mitigate bias as much as possible moving forward [23]. Machine learning models must be constantly re-training in order to reflect changing disease patterns in order to have the greatest effect [8,16,34,46]. In order to do this, the extent of bias present first needs to be identified to prevent further bias. Negligence of bias in training datasets and prediction models should not be acceptable [47]. As a minimum, racial, ethnic, and other demographic information should be disclosed accompanying all clinical machine learning studies [48,49]. Due to the possibly great negative implications of bias, those creating training datasets and prediction models should take responsibility for ensuring that they are not contributing to increasing disparities in healthcare [9,47]. Furthermore, adding more diverse representation to existing datasets and then retraining models may mitigate bias however a great emphasis should be placed on the development of reduced bias datasets and models rather than attempting to improve existing models [50].
Additionally, according to Dick et al., there is a high likelihood that progress in using machine learning for melanoma diagnosis is hindered due to dermatologists feeling threatened by the technology [28]. The embrace of new machine learning technology by the physicians directly involved with the application of such technology facilitates a way to detect disparities directly in their use. The benefits of using machine learning and artificial intelligence to augment physician diagnoses and dermatology can better be achieved if dermatologists and other healthcare providers viewed the new technology as a tool instead of as competition for their jobs [42]. Consequently, physicians should remain in control of the ultimate diagnosis in order to embrace and work with new advances in artificial intelligence as well as keep them in check [15,24,34,49]. Despite the great risk for bias when considering dermatological imaging due to different skin colors, we have identified a significant gap in the literature focusing on addressing such bias in dermatology. More research should be done to identify the extent of bias present in dermatological models and datasets specifically in order to begin addressing the disparity present.
Finally, the value of successful applications of clinical models is difficult to overstate. By acknowledging, identifying, and removing the widespread racial biases in dermatological imaging datasets and corresponding machine learning models, the benefits of artificial intelligence can be realized by those of any demographic background and the large disparities in healthcare can be further addressed.